SUHASH REDDY IMMAREDDY - 45693242 - DATA-SCIENCE - PORTFOLIO-1

Firstly we need to import the packages inorder to Visualize, Analysing, Modify & Clean the datasets

In [1]:
import os  
import matplotlib as mp
import matplotlib.pyplot as plt#-------------------------Importing pyplot as plt
import pandas as pd
from matplotlib import pyplot as plt#--------------------Importing pyplot as plt ------ Can write in this fashion too.
from datetime import timedelta
import seaborn as sns
plt.style.use('seaborn')
%matplotlib inline
In [2]:
os.getcwd()  # show's the current working directory
Out[2]:
'C:\\Users\\suhas\\Documents\\GitHub\\portfolio-2019-suhashimmareddy'
In [3]:
os.chdir("data")
print("""changing the directory to the location where the data-set is available so that now we can read the file""")
os.getcwd() 
changing the directory to the location where the data-set is available so that now we can read the file
Out[3]:
'C:\\Users\\suhas\\Documents\\GitHub\\portfolio-2019-suhashimmareddy\\data'

Importing Strava.csv

In [4]:
strava = pd.read_csv('strava_export.csv', index_col='date', parse_dates=True) #making the date column as index
print(strava.head(2))
                           average_heartrate  average_temp  average_watts  \
date                                                                        
2018-01-02 20:47:51+00:00              100.6          21.0           73.8   
2018-01-04 01:36:53+00:00                NaN          24.0          131.7   

                          device_watts  distance  elapsed_time elevation_gain  \
date                                                                            
2018-01-02 20:47:51+00:00        False      15.2            94       316.00 m   
2018-01-04 01:36:53+00:00        False      18.0            52       236.00 m   

                           kudos  moving_time workout_type  
date                                                        
2018-01-02 20:47:51+00:00     10           73         Ride  
2018-01-04 01:36:53+00:00      5           46         Ride  

we had read the strava file and made the date column as the index and then localizing the time zones

In [5]:
strava.index = strava.index.tz_convert('Australia/Sydney')
strava.head()
Out[5]:
average_heartrate average_temp average_watts device_watts distance elapsed_time elevation_gain kudos moving_time workout_type
date
2018-01-03 07:47:51+11:00 100.6 21.0 73.8 False 15.2 94 316.00 m 10 73 Ride
2018-01-04 12:36:53+11:00 NaN 24.0 131.7 False 18.0 52 236.00 m 5 46 Ride
2018-01-04 13:56:00+11:00 83.1 25.0 13.8 False 0.0 3 0.00 m 2 2 Ride
2018-01-04 16:37:04+11:00 110.1 24.0 113.6 False 22.9 77 246.00 m 8 64 Ride
2018-01-06 06:22:46+11:00 110.9 20.0 147.7 True 58.4 189 676.00 m 12 144 Ride
In [6]:
strava.shape
Out[6]:
(268, 10)

Importing Cheetah.csv

In [7]:
cheetah = pd.read_csv('cheetah.csv', skipinitialspace=True)
cheetah.head()
Out[7]:
date time filename axPower aPower Relative Intensity aBikeScore Skiba aVI aPower Response Index aIsoPower aIF ... Rest AVNN Rest SDNN Rest rMSSD Rest PNN50 Rest LF Rest HF HRV Recovery Points NP IF TSS
0 01/28/18 06:39:49 2018_01_28_06_39_49.json 202.211 0.75452 16.6520 1.31920 1.67755 223.621 0.83441 ... 0 0 0 0 0 0 0 222.856 0.83155 20.2257
1 01/28/18 07:01:32 2018_01_28_07_01_32.json 226.039 0.84343 80.2669 1.21137 1.54250 246.185 0.91860 ... 0 0 0 0 0 0 0 245.365 0.91554 94.5787
2 02/01/18 08:13:34 2018_02_01_08_13_34.json 0.000 0.00000 0.0000 0.00000 0.00000 0.000 0.00000 ... 0 0 0 0 0 0 0 0.000 0.00000 0.0000
3 02/06/18 08:06:42 2018_02_06_08_06_42.json 221.672 0.82714 78.8866 1.35775 1.86002 254.409 0.94929 ... 0 0 0 0 0 0 0 253.702 0.94665 98.3269
4 02/07/18 17:59:05 2018_02_07_17_59_05.json 218.211 0.81422 159.4590 1.47188 1.74658 233.780 0.87231 ... 0 0 0 0 0 0 0 232.644 0.86808 171.0780

5 rows × 362 columns

In [8]:
cheetah.index = pd.to_datetime(cheetah['date'] + ' ' + cheetah['time'])
cheetah.index = cheetah.index.tz_localize('Australia/Sydney')
cheetah.head()
Out[8]:
date time filename axPower aPower Relative Intensity aBikeScore Skiba aVI aPower Response Index aIsoPower aIF ... Rest AVNN Rest SDNN Rest rMSSD Rest PNN50 Rest LF Rest HF HRV Recovery Points NP IF TSS
2018-01-28 06:39:49+11:00 01/28/18 06:39:49 2018_01_28_06_39_49.json 202.211 0.75452 16.6520 1.31920 1.67755 223.621 0.83441 ... 0 0 0 0 0 0 0 222.856 0.83155 20.2257
2018-01-28 07:01:32+11:00 01/28/18 07:01:32 2018_01_28_07_01_32.json 226.039 0.84343 80.2669 1.21137 1.54250 246.185 0.91860 ... 0 0 0 0 0 0 0 245.365 0.91554 94.5787
2018-02-01 08:13:34+11:00 02/01/18 08:13:34 2018_02_01_08_13_34.json 0.000 0.00000 0.0000 0.00000 0.00000 0.000 0.00000 ... 0 0 0 0 0 0 0 0.000 0.00000 0.0000
2018-02-06 08:06:42+11:00 02/06/18 08:06:42 2018_02_06_08_06_42.json 221.672 0.82714 78.8866 1.35775 1.86002 254.409 0.94929 ... 0 0 0 0 0 0 0 253.702 0.94665 98.3269
2018-02-07 17:59:05+11:00 02/07/18 17:59:05 2018_02_07_17_59_05.json 218.211 0.81422 159.4590 1.47188 1.74658 233.780 0.87231 ... 0 0 0 0 0 0 0 232.644 0.86808 171.0780

5 rows × 362 columns

we had read the cheetah file and made the date column as the index and then localizing the time zones

WE now try to understand the type of data that is present in the strava and cheetah file.

In [9]:
print("The shape of strava table is rows, column",strava.shape)
print(strava.get_dtype_counts())
print("\n")
print("getting the datatype for all columns", strava.dtypes)
The shape of strava table is rows, column (268, 10)
float64    4
int64      3
object     3
dtype: int64


getting the datatype for all columns average_heartrate    float64
average_temp         float64
average_watts        float64
device_watts          object
distance             float64
elapsed_time           int64
elevation_gain        object
kudos                  int64
moving_time            int64
workout_type          object
dtype: object
In [10]:
print("The shape of Cheetah table is rows, column",cheetah.shape)
print(cheetah.get_dtype_counts())
print("\n")
print("getting the datatype for all columns", cheetah.dtypes)
The shape of Cheetah table is rows, column (251, 362)
float64    167
int64      192
object       3
dtype: int64


getting the datatype for all columns date                                object
time                                object
filename                            object
axPower                            float64
aPower Relative Intensity          float64
aBikeScore                         float64
Skiba aVI                          float64
aPower Response Index              float64
aIsoPower                          float64
aIF                                float64
aBikeStress                        float64
aVI                                float64
aPower Efficiency Factor           float64
aBikeStress per hour               float64
Aerobic Decoupling                 float64
Power Index                        float64
Activities                           int64
To Exhaustion                        int64
Elapsed Time                         int64
Duration                             int64
Time Moving                          int64
Time Carrying (Est)                  int64
Elevation Gain Carrying (Est)      float64
Distance                           float64
Climb Rating                       float64
Athlete Weight                       int64
Athlete Bodyfat                      int64
Athlete Bones                        int64
Athlete Muscles                      int64
Athlete Lean Weight                  int64
                                    ...   
W3 W'bal Work Heavy Fatigue        float64
W4 W'bal Work Severe Fatigue       float64
Below CP Work                      float64
Fraction of normal RR intervals      int64
Average of all NN intervals          int64
Standard deviation of NN             int64
SDANN                                int64
SDNNIDX                              int64
rMSSD                                int64
pNN5                                 int64
pNN10                                int64
pNN15                                int64
pNN20                                int64
pNN25                                int64
pNN30                                int64
pNN35                                int64
pNN40                                int64
pNN45                                int64
pNN50                                int64
Rest HR                              int64
Rest AVNN                            int64
Rest SDNN                            int64
Rest rMSSD                           int64
Rest PNN50                           int64
Rest LF                              int64
Rest HF                              int64
HRV Recovery Points                  int64
NP                                 float64
IF                                 float64
TSS                                float64
Length: 362, dtype: object

we are doing inner join so that we will get the records that present in both dataframes which are strava and cheetah

In [11]:
cheetah_strava_innerjoin = pd.merge(left = cheetah, right = strava, left_index = True, right_index= True, how='inner')
print("The new shape of the joined table is :",cheetah_strava_innerjoin.shape)
cheetah_strava_innerjoin.head()
The new shape of the joined table is : (243, 372)
Out[11]:
date time filename axPower aPower Relative Intensity aBikeScore Skiba aVI aPower Response Index aIsoPower aIF ... average_heartrate average_temp average_watts device_watts distance elapsed_time elevation_gain kudos moving_time workout_type
2018-01-28 06:39:49+11:00 01/28/18 06:39:49 2018_01_28_06_39_49.json 202.211 0.75452 16.6520 1.31920 1.67755 223.621 0.83441 ... 120.6 21.0 153.4 True 7.6 17 95.00 m 4 17 Ride
2018-01-28 07:01:32+11:00 01/28/18 07:01:32 2018_01_28_07_01_32.json 226.039 0.84343 80.2669 1.21137 1.54250 246.185 0.91860 ... 146.9 22.0 187.7 True 38.6 67 449.00 m 19 67 Race
2018-02-01 08:13:34+11:00 02/01/18 08:13:34 2018_02_01_08_13_34.json 0.000 0.00000 0.0000 0.00000 0.00000 0.000 0.00000 ... 109.8 19.0 143.0 False 26.3 649 612.00 m 6 113 Ride
2018-02-06 08:06:42+11:00 02/06/18 08:06:42 2018_02_06_08_06_42.json 221.672 0.82714 78.8866 1.35775 1.86002 254.409 0.94929 ... 119.3 19.0 165.9 True 24.3 69 439.00 m 6 65 Ride
2018-02-07 17:59:05+11:00 02/07/18 17:59:05 2018_02_07_17_59_05.json 218.211 0.81422 159.4590 1.47188 1.74658 233.780 0.87231 ... 124.8 20.0 151.0 True 47.1 144 890.00 m 10 134 Ride

5 rows × 372 columns

In [12]:
print("cheeta_strava_innerjoin-count---------------------------------------------------------------")
total_row_count   = cheetah_strava_innerjoin.count()
print(total_row_count.head(5))
print("strava-count--------------------------------------------------------------------------------")
total_row_count1 = strava.count()
print(total_row_count1.head(5))
print("cheetah-count-------------------------------------------------------------------------------")
total_row_count2 = cheetah.count()
print(total_row_count2.head(5))
cheeta_strava_innerjoin-count---------------------------------------------------------------
date                         243
time                         243
filename                     243
axPower                      243
aPower Relative Intensity    243
dtype: int64
strava-count--------------------------------------------------------------------------------
average_heartrate    232
average_temp         204
average_watts        254
device_watts         260
distance             268
dtype: int64
cheetah-count-------------------------------------------------------------------------------
date                         251
time                         251
filename                     251
axPower                      251
aPower Relative Intensity    251
dtype: int64

So after joining the two tables on index, we got the new table which now only contains the data that is common in both data sets. We can also see that some of the rows were taken off from the joined table compared to the respective individual tables where we can see rows with a count greater than 243.

1. Remove rides with no measured power (where device_watts is False) - these are commutes or MTB rides

In [13]:
cheetah_strava_innerjoin['device_watts'].head()
Out[13]:
2018-01-28 06:39:49+11:00     True
2018-01-28 07:01:32+11:00     True
2018-02-01 08:13:34+11:00    False
2018-02-06 08:06:42+11:00     True
2018-02-07 17:59:05+11:00     True
Name: device_watts, dtype: object

were are trying to Know the type of data holds by the device_watts column , Which is Boolean i,e. True OR False, so now we can drop the rows which contains the value as False in the device_watts column

Dropping a row by condition

syntax is --- df[df.Name != 'Alisa']

In [14]:
cheetah_strava_innerjoin = cheetah_strava_innerjoin[cheetah_strava_innerjoin.device_watts != False]
In [15]:
print("The Shape of the joined table after droping rows where device_watts is False", cheetah_strava_innerjoin.shape)
The Shape of the joined table after droping rows where device_watts is False (209, 372)

After droping the rows where the device_watts if False the row count is dropped to 209 from 243

2. Look at the distributions of some key variables: time, distance, average speed, average power, TSS. Are they normally distributed? Skewed?

There are four varibales with the name time so we will be plotting the distribution graphs for all of those variables

In [16]:
import sys
import warnings

if not sys.warnoptions:
    warnings.simplefilter("ignore")

cheetah_strava_innerjoin['time'] = pd.to_datetime(cheetah_strava_innerjoin['time']).dt.strftime('%H:%M:%S')
sns.scatterplot(cheetah_strava_innerjoin['time'],cheetah_strava_innerjoin.index)
plt.show()

Ploting the time against the index which is date-time we can see that it closely follows a straight line(can exclude the outliers), so that we can say that time column follows the normal distribution

In [17]:
def skewness(x):
    res = 0
    m = x.mean()
    s = x.std()
    for i in x:
        res += (i-m) * (i-m) * (i-m)
    res /= ( len(x) * s * s * s)
    return res
In [18]:
sns.distplot(cheetah_strava_innerjoin.moving_time)
print("Skewness of the moving_time = ",skewness(cheetah_strava_innerjoin.moving_time))
Skewness of the moving_time =  0.5471632744534792

The Distribution of the moving time is bimodal distribution and has a positive skewness which means it is skewed right.

In [19]:
sns.distplot(cheetah_strava_innerjoin.elapsed_time)
print("Skewness of the elapsed_time = ",skewness(cheetah_strava_innerjoin.elapsed_time))
Skewness of the elapsed_time =  0.63069381841196

The Distribution of the elapsed time is bimodal distribution and has a positive skewness which means it is skewed right.

In [20]:
sns.distplot(cheetah_strava_innerjoin['Time Moving'])
print("Skewness of the Time Moving = ",skewness(cheetah_strava_innerjoin['Time Moving']))
Skewness of the Time Moving =  0.5541628398239516

The Distribution of the time moving is bimodal distribution and has a positive skewness which means it is skewed right.

In [21]:
sns.distplot(cheetah_strava_innerjoin.distance)
print("Skewness of the distance = ",skewness(cheetah_strava_innerjoin.distance))
Skewness of the distance =  0.5124316095249656

The Distribution of the distance is bimodal distribution and has a positive skewness which means it is skewed right.

In [22]:
sns.distplot(cheetah_strava_innerjoin['Average Power'])
print("Skewness of the Average Power = ",skewness(cheetah_strava_innerjoin['Average Power']))
Skewness of the Average Power =  -0.8263555425244128

The Distribution of theAverage Power is bimodal distribution and has a negative skewness which means it is skewed left.

In [23]:
cheetah_strava_innerjoin['average_watts'] = cheetah_strava_innerjoin['average_watts'].fillna(0.0)
sns.distplot(cheetah_strava_innerjoin.average_watts)
print("Skewness of the Average watts = ",skewness(cheetah_strava_innerjoin['average_watts']))
Skewness of the Average watts =  -0.9106983699735207

The Distribution of the average_watts is bimodal distribution and has a negative skewness which means it is skewed left, it is mainly because of the outlier if we exclude that one then it follows a normal distribution.

In [24]:
sns.distplot(cheetah_strava_innerjoin.TSS)
print("Skewness of the TSS = ",skewness(cheetah_strava_innerjoin['TSS']))
Skewness of the TSS =  1.0406742960001307

The Distribution of theTss is bimodal distribution and has a positive skewness which means it is skewed right.

In [25]:
sns.distplot(cheetah_strava_innerjoin['Average Speed'])
print("Skewness of the Average Speed = ",skewness(cheetah_strava_innerjoin['Average Speed']))
Skewness of the Average Speed =  -0.559270091530111

The Distribution of the moving time is multimodal distribution and had a Negative skewness which means it is skewed left, if we exclude the outlier then it follows bimodal distribution.

3. Explore the relationships between the following variables. Are any of them corrolated with each other (do they vary together in a predictable way)? Can you explain any relationships you observe?

Distance

Moving Time

Average Speed

Heart Rate

Power (watts)

Normalised power (NP)

Training Stress Score

Elevation Gain

In [26]:
C_S_I = cheetah_strava_innerjoin[["distance","moving_time","Average Speed","average_heartrate","Average Power","NP","TSS","Elevation Gain"]]
In [27]:
C_S_I.head()
Out[27]:
distance moving_time Average Speed average_heartrate Average Power NP TSS Elevation Gain
2018-01-28 06:39:49+11:00 7.6 17 26.0234 120.6 153.283 222.856 20.2257 77.8
2018-01-28 07:01:32+11:00 38.6 67 34.4380 146.9 186.599 245.365 94.5787 362.2
2018-02-06 08:06:42+11:00 24.3 65 22.2417 119.3 163.264 253.702 98.3269 355.8
2018-02-07 17:59:05+11:00 47.1 134 20.7841 124.8 148.253 232.644 171.0780 815.4
2018-02-10 06:18:03+11:00 59.8 139 25.6585 123.0 143.918 212.726 147.7970 513.2
In [28]:
import sys
import warnings

if not sys.warnoptions:
    warnings.simplefilter("ignore")

sns.pairplot(C_S_I)
Out[28]:
<seaborn.axisgrid.PairGrid at 0x1f1b68c4978>
In [29]:
CORRELATION = C_S_I.corr()
In [30]:
CORRELATION
Out[30]:
distance moving_time Average Speed average_heartrate Average Power NP TSS Elevation Gain
distance 1.000000 0.939383 0.187363 0.114595 0.129199 0.270703 0.922565 0.805468
moving_time 0.939383 1.000000 -0.103484 -0.048611 -0.109838 0.044431 0.871368 0.813146
Average Speed 0.187363 -0.103484 1.000000 0.742388 0.814403 0.674857 0.134054 -0.016160
average_heartrate 0.114595 -0.048611 0.742388 1.000000 0.692413 0.593091 0.113775 0.071553
Average Power 0.129199 -0.109838 0.814403 0.692413 1.000000 0.844487 0.225290 -0.035987
NP 0.270703 0.044431 0.674857 0.593091 0.844487 1.000000 0.432286 0.229933
TSS 0.922565 0.871368 0.134054 0.113775 0.225290 0.432286 1.000000 0.828928
Elevation Gain 0.805468 0.813146 -0.016160 0.071553 -0.035987 0.229933 0.828928 1.000000

From the graph and the correlation matrix we can see that : -

With respect to distance -> it is higly correlated with moving_time followed by TSS and Elevation Gain.

With respect to moving_time -> it is highly correlated with distance followed by TSS and Elevation Gain.

With respect to average speed ->it is higly correlated with Average Power followed by average heart rate.

With respect to average_heartrate -> it is higly crrelated with Average Speed followed by Average heartrate.

With respect to average power -> it is higly correlated with NP followed by Average Speed.

With respect to NP -> it is higly correlated with Average Power.

With respect toTSS -> it is higly correlated with distance followed by moving_time and Elevatio Gain.

With respect to Elevation gain -> it is higly correlated with TSS followed by moving_time and distance.

4. We want to explore the differences between the three categories: Race, Workout and Ride.

4a. Use scatter plots with different colours for each category to explore how these categories differ.

4b. Use histograms or box plots to visualise the different distributions of a variable for the three categories.

4c. In both cases, experiment with different variables but only include those that are interesting in your final notebook (if none are interesting, show us a representative example).

In [31]:
C_S_I1 = cheetah_strava_innerjoin[["distance","workout_type","moving_time","Average Speed","average_heartrate","Average Power","NP","TSS","Elevation Gain"]]
In [32]:
C_S_I1.head()
Out[32]:
distance workout_type moving_time Average Speed average_heartrate Average Power NP TSS Elevation Gain
2018-01-28 06:39:49+11:00 7.6 Ride 17 26.0234 120.6 153.283 222.856 20.2257 77.8
2018-01-28 07:01:32+11:00 38.6 Race 67 34.4380 146.9 186.599 245.365 94.5787 362.2
2018-02-06 08:06:42+11:00 24.3 Ride 65 22.2417 119.3 163.264 253.702 98.3269 355.8
2018-02-07 17:59:05+11:00 47.1 Ride 134 20.7841 124.8 148.253 232.644 171.0780 815.4
2018-02-10 06:18:03+11:00 59.8 Ride 139 25.6585 123.0 143.918 212.726 147.7970 513.2

Swarm Plots tellls us about the data distribution for each category in a variable.

In [33]:
sns.catplot(x='workout_type',y='average_heartrate',kind='swarm',data=C_S_I1)
plt.grid(True)
In [34]:
sns.catplot(x='workout_type',y='average_heartrate',kind='box',data=C_S_I1)
plt.grid(True)
In [35]:
sns.catplot(x='workout_type',y='distance',kind='swarm',data=C_S_I1)
plt.grid(True)
In [36]:
sns.catplot(x='workout_type',y='distance',kind='box',data=C_S_I1)
plt.grid(True)
In [37]:
#hue='workout_type'
import sys
import warnings

if not sys.warnoptions:
    warnings.simplefilter("ignore")
sns.pairplot(C_S_I1, hue = 'workout_type')
Out[37]:
<seaborn.axisgrid.PairGrid at 0x1f1bac5b2e8>

distance -> moving_time, TSS and Elevation Gain (Ride increases with increase in distance)

moving_time -> distance, TSS and Elevation Gain (Ride increases with increase in moving_time)

average_speed -> avg_heartrate, average power, NP (Race increases with increase in average speed)

average_heartrate -> avg_speed, avg_power, NPP (both Race and RIde increases with increase in average_heartrate)

NPP -> TSS (Ride increses with increse in NP)

NPP -> elevation_gain (Ride increses with increse in NP)

NPP -> average_heartrate (Race increses with increse in NP)

NPP -> average_power (Race increses with increse in NP)

TSS -> elevation_gain (Ride increses with increse in TSS)

Elevation Gain -> Ride increase with increase in Elevation gain against all other variables

Challenges

1. What leads to more kudos? Is there anything to indicate which rides are more popular? Explore the relationship between the main variables and kudos. Show a plot and comment on any relationship you observe

In [38]:
#what leads to more kudos
C_S_I2 = cheetah_strava_innerjoin[["workout_type","kudos","distance","moving_time"]]
sns.catplot(x="workout_type", y="kudos", hue="workout_type", kind="box", data=C_S_I2)
Out[38]:
<seaborn.axisgrid.FacetGrid at 0x1f1bca13908>
In [39]:
sns.catplot(x="workout_type", y="kudos", hue="workout_type", kind="bar", data=C_S_I2)
Out[39]:
<seaborn.axisgrid.FacetGrid at 0x1f1b9189dd8>

well we can see from the boxplot as well as bar chart that Race workout_type leads to More Kudos

In [40]:
C_S_I1 = cheetah_strava_innerjoin[["distance","workout_type","kudos","moving_time","Average Speed","average_heartrate","Average Power","NP","TSS","Elevation Gain"]]
In [41]:
CORRELATION = C_S_I1.corr()
In [42]:
CORRELATION
Out[42]:
distance kudos moving_time Average Speed average_heartrate Average Power NP TSS Elevation Gain
distance 1.000000 0.753808 0.939383 0.187363 0.114595 0.129199 0.270703 0.922565 0.805468
kudos 0.753808 1.000000 0.663127 0.393050 0.387071 0.264819 0.353419 0.694799 0.637602
moving_time 0.939383 0.663127 1.000000 -0.103484 -0.048611 -0.109838 0.044431 0.871368 0.813146
Average Speed 0.187363 0.393050 -0.103484 1.000000 0.742388 0.814403 0.674857 0.134054 -0.016160
average_heartrate 0.114595 0.387071 -0.048611 0.742388 1.000000 0.692413 0.593091 0.113775 0.071553
Average Power 0.129199 0.264819 -0.109838 0.814403 0.692413 1.000000 0.844487 0.225290 -0.035987
NP 0.270703 0.353419 0.044431 0.674857 0.593091 0.844487 1.000000 0.432286 0.229933
TSS 0.922565 0.694799 0.871368 0.134054 0.113775 0.225290 0.432286 1.000000 0.828928
Elevation Gain 0.805468 0.637602 0.813146 -0.016160 0.071553 -0.035987 0.229933 0.828928 1.000000
In [43]:
sns.pairplot(C_S_I2, hue = 'workout_type')
Out[43]:
<seaborn.axisgrid.PairGrid at 0x1f1bcbd7550>
In [44]:
CORRELATION2 = C_S_I2.corr()
In [45]:
CORRELATION2
Out[45]:
kudos distance moving_time
kudos 1.000000 0.753808 0.663127
distance 0.753808 1.000000 0.939383
moving_time 0.663127 0.939383 1.000000

From the correltion matrix1 we can see that distance explains the more about kudos than compared with moving_time and TSS, so we are including only distance and moving time in my final report and implemented the paitplot.

As the no of kudos increases with the increase in distance and TSS, but distance is the main variable

Challenge 2

In [46]:
suhash = pd.DataFrame()
In [47]:
suhash["date"] = cheetah_strava_innerjoin.date
In [48]:
suhash["distance"] = cheetah_strava_innerjoin.distance
In [49]:
suhash["TSS"] = cheetah_strava_innerjoin.TSS
In [50]:
suhash["AverageSpeed"] = cheetah_strava_innerjoin["Average Speed"]
In [51]:
suhash.head()
Out[51]:
date distance TSS AverageSpeed
2018-01-28 06:39:49+11:00 01/28/18 7.6 20.2257 26.0234
2018-01-28 07:01:32+11:00 01/28/18 38.6 94.5787 34.4380
2018-02-06 08:06:42+11:00 02/06/18 24.3 98.3269 22.2417
2018-02-07 17:59:05+11:00 02/07/18 47.1 171.0780 20.7841
2018-02-10 06:18:03+11:00 02/10/18 59.8 147.7970 25.6585
In [52]:
from datetime import datetime
from datetime import timedelta
In [53]:
suhash['date'] = pd.to_datetime(suhash['date'])
In [54]:
suhash.head()
Out[54]:
date distance TSS AverageSpeed
2018-01-28 06:39:49+11:00 2018-01-28 7.6 20.2257 26.0234
2018-01-28 07:01:32+11:00 2018-01-28 38.6 94.5787 34.4380
2018-02-06 08:06:42+11:00 2018-02-06 24.3 98.3269 22.2417
2018-02-07 17:59:05+11:00 2018-02-07 47.1 171.0780 20.7841
2018-02-10 06:18:03+11:00 2018-02-10 59.8 147.7970 25.6585
In [55]:
suhash1 = suhash.reset_index(drop=True)
In [56]:
suhash1.head()
Out[56]:
date distance TSS AverageSpeed
0 2018-01-28 7.6 20.2257 26.0234
1 2018-01-28 38.6 94.5787 34.4380
2 2018-02-06 24.3 98.3269 22.2417
3 2018-02-07 47.1 171.0780 20.7841
4 2018-02-10 59.8 147.7970 25.6585
In [57]:
Distance_Month = suhash1.set_index('date').groupby(pd.Grouper(freq='M'))['distance','TSS'].sum()
In [58]:
Distance_Month["Averageg_Speed"] = suhash1.set_index('date').groupby(pd.Grouper(freq='M'))['AverageSpeed'].mean()
In [59]:
Distance_Month = Distance_Month.reset_index()
In [60]:
Distance_Month
Out[60]:
date distance TSS Averageg_Speed
0 2018-01-31 46.2 114.8044 30.230700
1 2018-02-28 360.9 1087.2924 24.103918
2 2018-03-31 468.0 1381.0867 25.128942
3 2018-04-30 450.2 1324.5363 22.974325
4 2018-05-31 273.3 718.8654 24.518643
5 2018-06-30 193.4 586.4858 27.037100
6 2018-07-31 180.7 381.4320 23.512267
7 2018-08-31 125.1 370.5251 25.134067
8 2018-09-30 184.8 627.2077 26.435125
9 2018-10-31 417.4 1023.1827 25.401967
10 2018-11-30 671.9 1793.3806 25.143025
11 2018-12-31 523.1 1498.1566 26.416408
12 2019-01-31 380.1 1057.9363 20.976964
13 2019-02-28 482.4 1269.9272 26.247547
14 2019-03-31 488.1 1611.9748 27.005531
15 2019-04-30 612.9 1609.1064 24.161920
16 2019-05-31 566.6 1591.1052 26.868062
17 2019-06-30 516.5 1467.1178 26.736235
18 2019-07-31 367.2 1096.1419 26.565400
In [61]:
plt.figure(figsize = (10,5))
plt.plot(Distance_Month["date"], Distance_Month["distance"], marker = '*')
plt.title("Distance Travelled by month")
Out[61]:
Text(0.5, 1.0, 'Distance Travelled by month')
In [ ]:
 

SUHASH REDDY IMMAREDDY - 45693242 - DATA-SCIENCE - PORTFOLIO-2

In [62]:
import os
In [65]:
os.getcwd()
Out[65]:
'C:\\Users\\suhas\\Documents\\GitHub\\portfolio-2019-suhashimmareddy\\data'
In [70]:
os.chdir("C:\\Users\\suhas\\Documents\\GitHub\\portfolio-2019-suhashimmareddy\\Appliances-energy-prediction-data")
In [71]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pylab as plt
from matplotlib.colors import ListedColormap
%matplotlib inline
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error,r2_score
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.feature_selection import RFE
from datetime import datetime
from datetime import timedelta
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
%matplotlib inline
from sklearn.feature_selection import f_regression
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import RandomForestRegressor

reading the data and understanding the the structure

In [72]:
edc = pd.read_csv('energydata_complete.csv')
In [73]:
edc.head()
Out[73]:
date Appliances lights T1 RH_1 T2 RH_2 T3 RH_3 T4 ... T9 RH_9 T_out Press_mm_hg RH_out Windspeed Visibility Tdewpoint rv1 rv2
0 2016-01-11 17:00:00 60 30 19.89 47.596667 19.2 44.790000 19.79 44.730000 19.000000 ... 17.033333 45.53 6.600000 733.5 92.0 7.000000 63.000000 5.3 13.275433 13.275433
1 2016-01-11 17:10:00 60 30 19.89 46.693333 19.2 44.722500 19.79 44.790000 19.000000 ... 17.066667 45.56 6.483333 733.6 92.0 6.666667 59.166667 5.2 18.606195 18.606195
2 2016-01-11 17:20:00 50 30 19.89 46.300000 19.2 44.626667 19.79 44.933333 18.926667 ... 17.000000 45.50 6.366667 733.7 92.0 6.333333 55.333333 5.1 28.642668 28.642668
3 2016-01-11 17:30:00 50 40 19.89 46.066667 19.2 44.590000 19.79 45.000000 18.890000 ... 17.000000 45.40 6.250000 733.8 92.0 6.000000 51.500000 5.0 45.410389 45.410389
4 2016-01-11 17:40:00 60 40 19.89 46.333333 19.2 44.530000 19.79 45.000000 18.890000 ... 17.000000 45.40 6.133333 733.9 92.0 5.666667 47.666667 4.9 10.084097 10.084097

5 rows × 29 columns

In [74]:
edc.shape
Out[74]:
(19735, 29)
In [75]:
edc.describe()
Out[75]:
Appliances lights T1 RH_1 T2 RH_2 T3 RH_3 T4 RH_4 ... T9 RH_9 T_out Press_mm_hg RH_out Windspeed Visibility Tdewpoint rv1 rv2
count 19735.000000 19735.000000 19735.000000 19735.000000 19735.000000 19735.000000 19735.000000 19735.000000 19735.000000 19735.000000 ... 19735.000000 19735.000000 19735.000000 19735.000000 19735.000000 19735.000000 19735.000000 19735.000000 19735.000000 19735.000000
mean 97.694958 3.801875 21.686571 40.259739 20.341219 40.420420 22.267611 39.242500 20.855335 39.026904 ... 19.485828 41.552401 7.411665 755.522602 79.750418 4.039752 38.330834 3.760707 24.988033 24.988033
std 102.524891 7.935988 1.606066 3.979299 2.192974 4.069813 2.006111 3.254576 2.042884 4.341321 ... 2.014712 4.151497 5.317409 7.399441 14.901088 2.451221 11.794719 4.194648 14.496634 14.496634
min 10.000000 0.000000 16.790000 27.023333 16.100000 20.463333 17.200000 28.766667 15.100000 27.660000 ... 14.890000 29.166667 -5.000000 729.300000 24.000000 0.000000 1.000000 -6.600000 0.005322 0.005322
25% 50.000000 0.000000 20.760000 37.333333 18.790000 37.900000 20.790000 36.900000 19.530000 35.530000 ... 18.000000 38.500000 3.666667 750.933333 70.333333 2.000000 29.000000 0.900000 12.497889 12.497889
50% 60.000000 0.000000 21.600000 39.656667 20.000000 40.500000 22.100000 38.530000 20.666667 38.400000 ... 19.390000 40.900000 6.916667 756.100000 83.666667 3.666667 40.000000 3.433333 24.897653 24.897653
75% 100.000000 0.000000 22.600000 43.066667 21.500000 43.260000 23.290000 41.760000 22.100000 42.156667 ... 20.600000 44.338095 10.408333 760.933333 91.666667 5.500000 40.000000 6.566667 37.583769 37.583769
max 1080.000000 70.000000 26.260000 63.360000 29.856667 56.026667 29.236000 50.163333 26.200000 51.090000 ... 24.500000 53.326667 26.100000 772.300000 100.000000 14.000000 66.000000 15.500000 49.996530 49.996530

8 rows × 28 columns

In [76]:
edc.columns
Out[76]:
Index(['date', 'Appliances', 'lights', 'T1', 'RH_1', 'T2', 'RH_2', 'T3',
       'RH_3', 'T4', 'RH_4', 'T5', 'RH_5', 'T6', 'RH_6', 'T7', 'RH_7', 'T8',
       'RH_8', 'T9', 'RH_9', 'T_out', 'Press_mm_hg', 'RH_out', 'Windspeed',
       'Visibility', 'Tdewpoint', 'rv1', 'rv2'],
      dtype='object')
In [77]:
edc.dtypes
Out[77]:
date            object
Appliances       int64
lights           int64
T1             float64
RH_1           float64
T2             float64
RH_2           float64
T3             float64
RH_3           float64
T4             float64
RH_4           float64
T5             float64
RH_5           float64
T6             float64
RH_6           float64
T7             float64
RH_7           float64
T8             float64
RH_8           float64
T9             float64
RH_9           float64
T_out          float64
Press_mm_hg    float64
RH_out         float64
Windspeed      float64
Visibility     float64
Tdewpoint      float64
rv1            float64
rv2            float64
dtype: object

Data exploration - we try to add some new columns for better understandings

In [78]:
edc1 = edc  # just making sure that we do have a backup file of the original one
In [79]:
edc1['date'] = pd.to_datetime(edc1['date'])
In [80]:
len(edc1)
Out[80]:
19735
In [81]:
import sys
import warnings

if not sys.warnoptions:
    warnings.simplefilter("ignore")
    
month   = []
hours   = []
week    = [] 
Day     = []

for i in range (len(edc1)):
   
    month.append(edc1['date'][i].month_name())
    hours.append(edc1['date'][i])
    week.append(edc1['date'][i].week)
    Day.append(edc1['date'][i].dayofweek)
In [82]:
a = set(month)
In [83]:
a
Out[83]:
{'April', 'February', 'January', 'March', 'May'}
In [84]:
#Convert to Series

A=pd.Series(month)
B=pd.Series(hours).dt.floor("H") #floor the hour
C=pd.Series(week)
D=pd.Series(Day)

#adding the series to the dataframe            
edc1['Month']=A            
edc1['Hour']=B            
edc1['week']=C                 
edc1['Day']=D
edc1['Time'] = edc1.Hour.dt.hour
In [85]:
edc1.columns
Out[85]:
Index(['date', 'Appliances', 'lights', 'T1', 'RH_1', 'T2', 'RH_2', 'T3',
       'RH_3', 'T4', 'RH_4', 'T5', 'RH_5', 'T6', 'RH_6', 'T7', 'RH_7', 'T8',
       'RH_8', 'T9', 'RH_9', 'T_out', 'Press_mm_hg', 'RH_out', 'Windspeed',
       'Visibility', 'Tdewpoint', 'rv1', 'rv2', 'Month', 'Hour', 'week', 'Day',
       'Time'],
      dtype='object')
In [86]:
edc1.head()
Out[86]:
date Appliances lights T1 RH_1 T2 RH_2 T3 RH_3 T4 ... Windspeed Visibility Tdewpoint rv1 rv2 Month Hour week Day Time
0 2016-01-11 17:00:00 60 30 19.89 47.596667 19.2 44.790000 19.79 44.730000 19.000000 ... 7.000000 63.000000 5.3 13.275433 13.275433 January 2016-01-11 17:00:00 2 0 17
1 2016-01-11 17:10:00 60 30 19.89 46.693333 19.2 44.722500 19.79 44.790000 19.000000 ... 6.666667 59.166667 5.2 18.606195 18.606195 January 2016-01-11 17:00:00 2 0 17
2 2016-01-11 17:20:00 50 30 19.89 46.300000 19.2 44.626667 19.79 44.933333 18.926667 ... 6.333333 55.333333 5.1 28.642668 28.642668 January 2016-01-11 17:00:00 2 0 17
3 2016-01-11 17:30:00 50 40 19.89 46.066667 19.2 44.590000 19.79 45.000000 18.890000 ... 6.000000 51.500000 5.0 45.410389 45.410389 January 2016-01-11 17:00:00 2 0 17
4 2016-01-11 17:40:00 60 40 19.89 46.333333 19.2 44.530000 19.79 45.000000 18.890000 ... 5.666667 47.666667 4.9 10.084097 10.084097 January 2016-01-11 17:00:00 2 0 17

5 rows × 34 columns

In [87]:
edc1.shape
Out[87]:
(19735, 34)
In [88]:
b = set(Day)
In [89]:
b
Out[89]:
{0, 1, 2, 3, 4, 5, 6}

0 is Monday, 1 is tuesday, 2 is wednesday, 3 is thursday, 4 is friday, 5 is saturday, 6 is sunday

In [90]:
plt.figure(figsize=(20, 7))
sns.distplot(edc1.Appliances)
Out[90]:
<matplotlib.axes._subplots.AxesSubplot at 0x1f1bf773518>
In [91]:
plt.figure(figsize=(20, 7))
plt.hist(edc1["Appliances"], bins='auto', color='#ff0000', alpha=0.7, rwidth=0.85)
plt.show()
In [92]:
plt.figure(figsize=(16, 6))
sns.set(style="whitegrid")
sns.boxplot(edc1["Appliances"])
Out[92]:
<matplotlib.axes._subplots.AxesSubplot at 0x1f1c02beba8>
In [93]:
# box plot of Appliances energy Consuption, grouped by month
plt.figure(figsize=(20, 7))
sns.boxplot(x="Appliances", y="Month", data=edc1)
Out[93]:
<matplotlib.axes._subplots.AxesSubplot at 0x1f1c0418f60>
In [94]:
plt.figure(figsize=(20, 7))
ax1 = sns.lineplot(data=edc1["Appliances"], color="coral", label="line")
ax1.set_ylabel('Appliances (Wh)');
In [95]:
plt.figure(figsize=(20, 7))
plt.ylabel('Appliances(wh)')
plt.xlabel('per month')

plt.plot(edc1.date,edc1.Appliances)
plt.xticks(rotation = -75)
Out[95]:
(array([735985., 735995., 736016., 736024., 736045., 736055., 736076.,
        736085., 736106., 736116.]), <a list of 10 Text xticklabel objects>)
In [96]:
week = edc1["date"].dt.week
first_week = edc1[week == min(week)]
first_week
plt.figure(figsize=(20,7))
plt.plot(first_week['date'],first_week['Appliances'])
Out[96]:
[<matplotlib.lines.Line2D at 0x1f1c083b400>]

Now we try to represent Hourly energy consumption of appliances in heat map for four consecutive weeks[Week3,Week4,Week5,Week6]

In [97]:
energy1=pd.DataFrame() # making a new dataframe and appending alll required feilds to it that are required for the heat map
energy2 =edc1.groupby('Hour',as_index=False).agg({"Appliances": "sum"})
In [98]:
energy2.head()
Out[98]:
Hour Appliances
0 2016-01-11 17:00:00 330
1 2016-01-11 18:00:00 1060
2 2016-01-11 19:00:00 1040
3 2016-01-11 20:00:00 750
4 2016-01-11 21:00:00 620
In [99]:
energy1['Hour'] = edc1['Hour']
energy1['week'] = edc1['week']
energy1['Day']  = edc1['Day']
energy1['Time'] = edc1['Time']
In [100]:
energy1.head()
Out[100]:
Hour week Day Time
0 2016-01-11 17:00:00 2 0 17
1 2016-01-11 17:00:00 2 0 17
2 2016-01-11 17:00:00 2 0 17
3 2016-01-11 17:00:00 2 0 17
4 2016-01-11 17:00:00 2 0 17
In [101]:
energy3 =energy1.groupby('Hour',as_index=False).first()
In [102]:
energy3.head()
Out[102]:
Hour week Day Time
0 2016-01-11 17:00:00 2 0 17
1 2016-01-11 18:00:00 2 0 18
2 2016-01-11 19:00:00 2 0 19
3 2016-01-11 20:00:00 2 0 20
4 2016-01-11 21:00:00 2 0 21
In [103]:
energy1=pd.merge(energy3,energy2)
In [104]:
energy1.head()
Out[104]:
Hour week Day Time Appliances
0 2016-01-11 17:00:00 2 0 17 330
1 2016-01-11 18:00:00 2 0 18 1060
2 2016-01-11 19:00:00 2 0 19 1040
3 2016-01-11 20:00:00 2 0 20 750
4 2016-01-11 21:00:00 2 0 21 620
In [105]:
week3=energy1[energy1.week == 3]
week3 = week3.drop(['Hour', 'week'], axis=1) #selecting only the required columns
In [106]:
week3.head()
Out[106]:
Day Time Appliances
151 0 0 270
152 0 1 260
153 0 2 260
154 0 3 250
155 0 4 270
In [107]:
week3 = week3.pivot("Time","Day","Appliances")

week4=energy1[energy1.week == 4]
week4 = week4.drop(['Hour', 'week'], axis=1) 
week4 = week4.pivot("Time","Day","Appliances") 

week5=energy1[energy1.week == 5]
week5 = week5.drop(['Hour', 'week'], axis=1) 
week5 = week5.pivot("Time","Day","Appliances")  

week6=energy1[energy1.week == 6]
week6 = week6.drop(['Hour', 'week'], axis=1)
week6 = week6.pivot("Time","Day","Appliances") 
In [108]:
plt.figure(figsize=(4, 9))
plt.title("Week-3 Heat Map")
ax = sns.heatmap(week3,cmap = "YlOrRd",annot=True, fmt="d", linewidths=.5)
In [109]:
plt.figure(figsize=(4, 9))
plt.title("Week-4 Heat Map")
ax = sns.heatmap(week4,cmap = "YlOrRd",annot=True, fmt="d", linewidths=.5)
In [110]:
plt.figure(figsize=(4, 9))
plt.title("Week-5 Heat Map")
ax = sns.heatmap(week5,cmap = "YlOrRd",annot=True, fmt="d", linewidths=.5)
In [111]:
plt.figure(figsize=(4, 9))
plt.title("Week-6 Heat Map")
ax = sns.heatmap(week6,cmap = "YlOrRd",annot=True, fmt="d", linewidths=.5)

In the research Paper they did the heat map on top of the training data but i had done it on the total data

In [112]:
edc1.columns
Out[112]:
Index(['date', 'Appliances', 'lights', 'T1', 'RH_1', 'T2', 'RH_2', 'T3',
       'RH_3', 'T4', 'RH_4', 'T5', 'RH_5', 'T6', 'RH_6', 'T7', 'RH_7', 'T8',
       'RH_8', 'T9', 'RH_9', 'T_out', 'Press_mm_hg', 'RH_out', 'Windspeed',
       'Visibility', 'Tdewpoint', 'rv1', 'rv2', 'Month', 'Hour', 'week', 'Day',
       'Time'],
      dtype='object')
In [113]:
## each variable distribution and the correlation value with respect appliances is given below
In [114]:
def corr(x, y, **kwargs):
    coef = np.corrcoef(x, y)[0][1]
    label = r'$\rho$ = ' + str(round(coef, 2))
    ax = plt.gca()
    ax.annotate(label, xy = (0.2, 0.95), size = 20, xycoords = ax.transAxes)
In [115]:
ss = edc1[['Appliances','lights', 'T1', 'RH_1', 'T2', 'RH_2', 'T3','RH_3' ]]
In [116]:
g = sns.pairplot(ss)
g.map_lower(corr)
g.map_upper(corr)
plt.show()
In [117]:
ss1 = edc1[['Appliances','T4', 'RH_4', 'T5', 'RH_5', 'T6', 'RH_6', 'T7']]
In [118]:
g = sns.pairplot(ss1)
g.map_lower(corr)
g.map_upper(corr)
plt.show()
In [119]:
ss2 = edc1[['Appliances','RH_7', 'T8','RH_8', 'T9', 'RH_9', 'T_out', 'Press_mm_hg']]
In [120]:
g = sns.pairplot(ss2)
g.map_lower(corr)
g.map_upper(corr)
plt.show()
In [121]:
ss3 = edc1[['Appliances','RH_out', 'Windspeed','Visibility', 'Tdewpoint', 'rv1', 'rv2']]
In [122]:
g = sns.pairplot(ss3)
g.map_lower(corr)
g.map_upper(corr)
plt.show()
In [123]:
# model validation
In [124]:
train = pd.read_csv('training.csv', index_col='date', parse_dates=True)
In [125]:
test = pd.read_csv('testing.csv', index_col='date', parse_dates=True)
In [126]:
train = train.join(pd.get_dummies(train['Day_of_week']))
train.head()
Out[126]:
Appliances lights T1 RH_1 T2 RH_2 T3 RH_3 T4 RH_4 ... NSM WeekStatus Day_of_week Friday Monday Saturday Sunday Thursday Tuesday Wednesday
date
2016-01-11 17:00:00 60 30 19.89 47.596667 19.2 44.790000 19.79 44.730000 19.000000 45.566667 ... 61200 Weekday Monday 0 1 0 0 0 0 0
2016-01-11 17:10:00 60 30 19.89 46.693333 19.2 44.722500 19.79 44.790000 19.000000 45.992500 ... 61800 Weekday Monday 0 1 0 0 0 0 0
2016-01-11 17:20:00 50 30 19.89 46.300000 19.2 44.626667 19.79 44.933333 18.926667 45.890000 ... 62400 Weekday Monday 0 1 0 0 0 0 0
2016-01-11 17:40:00 60 40 19.89 46.333333 19.2 44.530000 19.79 45.000000 18.890000 45.530000 ... 63600 Weekday Monday 0 1 0 0 0 0 0
2016-01-11 17:50:00 50 40 19.89 46.026667 19.2 44.500000 19.79 44.933333 18.890000 45.730000 ... 64200 Weekday Monday 0 1 0 0 0 0 0

5 rows × 38 columns

In [127]:
train = train.join(pd.get_dummies(train['WeekStatus']))
train.head()
Out[127]:
Appliances lights T1 RH_1 T2 RH_2 T3 RH_3 T4 RH_4 ... Day_of_week Friday Monday Saturday Sunday Thursday Tuesday Wednesday Weekday Weekend
date
2016-01-11 17:00:00 60 30 19.89 47.596667 19.2 44.790000 19.79 44.730000 19.000000 45.566667 ... Monday 0 1 0 0 0 0 0 1 0
2016-01-11 17:10:00 60 30 19.89 46.693333 19.2 44.722500 19.79 44.790000 19.000000 45.992500 ... Monday 0 1 0 0 0 0 0 1 0
2016-01-11 17:20:00 50 30 19.89 46.300000 19.2 44.626667 19.79 44.933333 18.926667 45.890000 ... Monday 0 1 0 0 0 0 0 1 0
2016-01-11 17:40:00 60 40 19.89 46.333333 19.2 44.530000 19.79 45.000000 18.890000 45.530000 ... Monday 0 1 0 0 0 0 0 1 0
2016-01-11 17:50:00 50 40 19.89 46.026667 19.2 44.500000 19.79 44.933333 18.890000 45.730000 ... Monday 0 1 0 0 0 0 0 1 0

5 rows × 40 columns

In [128]:
test = test.join(pd.get_dummies(test['Day_of_week']))
test.head()
Out[128]:
Appliances lights T1 RH_1 T2 RH_2 T3 RH_3 T4 RH_4 ... NSM WeekStatus Day_of_week Friday Monday Saturday Sunday Thursday Tuesday Wednesday
date
2016-01-11 17:30:00 50 40 19.890000 46.066667 19.200000 44.590000 19.79 45.000000 18.89 45.723333 ... 63000 Weekday Monday 0 1 0 0 0 0 0
2016-01-11 18:00:00 60 50 19.890000 45.766667 19.200000 44.500000 19.79 44.900000 18.89 45.790000 ... 64800 Weekday Monday 0 1 0 0 0 0 0
2016-01-11 18:40:00 230 70 19.926667 45.863333 19.356667 44.400000 19.79 44.900000 18.89 46.430000 ... 67200 Weekday Monday 0 1 0 0 0 0 0
2016-01-11 18:50:00 580 60 20.066667 46.396667 19.426667 44.400000 19.79 44.826667 19.00 46.430000 ... 67800 Weekday Monday 0 1 0 0 0 0 0
2016-01-11 19:30:00 100 10 20.566667 53.893333 20.033333 46.756667 20.10 48.466667 19.00 48.490000 ... 70200 Weekday Monday 0 1 0 0 0 0 0

5 rows × 38 columns

In [129]:
test = test.join(pd.get_dummies(test['WeekStatus']))
test.head()
Out[129]:
Appliances lights T1 RH_1 T2 RH_2 T3 RH_3 T4 RH_4 ... Day_of_week Friday Monday Saturday Sunday Thursday Tuesday Wednesday Weekday Weekend
date
2016-01-11 17:30:00 50 40 19.890000 46.066667 19.200000 44.590000 19.79 45.000000 18.89 45.723333 ... Monday 0 1 0 0 0 0 0 1 0
2016-01-11 18:00:00 60 50 19.890000 45.766667 19.200000 44.500000 19.79 44.900000 18.89 45.790000 ... Monday 0 1 0 0 0 0 0 1 0
2016-01-11 18:40:00 230 70 19.926667 45.863333 19.356667 44.400000 19.79 44.900000 18.89 46.430000 ... Monday 0 1 0 0 0 0 0 1 0
2016-01-11 18:50:00 580 60 20.066667 46.396667 19.426667 44.400000 19.79 44.826667 19.00 46.430000 ... Monday 0 1 0 0 0 0 0 1 0
2016-01-11 19:30:00 100 10 20.566667 53.893333 20.033333 46.756667 20.10 48.466667 19.00 48.490000 ... Monday 0 1 0 0 0 0 0 1 0

5 rows × 40 columns

In [130]:
### traing the dataset with train dataset and validating against test dataset
In [131]:
c = train.columns
In [132]:
c = c.drop(['Day_of_week','WeekStatus','Weekend'])
In [133]:
c
Out[133]:
Index(['Appliances', 'lights', 'T1', 'RH_1', 'T2', 'RH_2', 'T3', 'RH_3', 'T4',
       'RH_4', 'T5', 'RH_5', 'T6', 'RH_6', 'T7', 'RH_7', 'T8', 'RH_8', 'T9',
       'RH_9', 'T_out', 'Press_mm_hg', 'RH_out', 'Windspeed', 'Visibility',
       'Tdewpoint', 'rv1', 'rv2', 'NSM', 'Friday', 'Monday', 'Saturday',
       'Sunday', 'Thursday', 'Tuesday', 'Wednesday', 'Weekday'],
      dtype='object')
In [134]:
X_train = train.drop(['Appliances','Day_of_week','WeekStatus'], axis = 1)
Y_train = train['Appliances']
X_test = test.drop(['Appliances','Day_of_week','WeekStatus'], axis = 1)
Y_test = test['Appliances']
print(X_train.shape)
print(Y_train.shape)
print(X_test.shape)
print(Y_test.shape)
(14803, 37)
(14803,)
(4932, 37)
(4932,)
In [135]:
### Fitting the linear model 
In [136]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
model = LinearRegression()
model.fit(X_train, Y_train)
Out[136]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
In [137]:
def mean_absolute_percentage_error1(y_true, y_pred): 
    y_true, y_pred = np.array(y_true), np.array(y_pred)
    return np.mean(np.abs((y_true - y_pred) / y_true)) * 100
In [138]:
import math
Y_pred = model.predict(X_test)
print("r2 for test data is",r2_score(Y_test, Y_pred))
print("MSE for test data is",mean_squared_error(Y_test, Y_pred))
print("RMSE for test data is",math.sqrt(mean_squared_error(Y_test, Y_pred)))
print("MAE for test data is",mean_absolute_error(Y_test, Y_pred))
print("MAPE for test data is",mean_absolute_percentage_error1(Y_test, model.predict(X_test)))
r2 for test data is 0.15900712778677029
MSE for test data is 8681.847902503212
RMSE for test data is 93.17643426587654
MAE for test data is 51.984929609383876
MAPE for test data is 59.955125611475204
In [139]:
residuals = Y_test - Y_pred
plt.figure(figsize = (10,5))
plt.scatter(test.Appliances,residuals)
plt.xlabel("Appliances")
plt.ylabel("residuals")
Out[139]:
Text(0, 0.5, 'residuals')
In [140]:
Y_pred1 = model.predict(X_train)
print("r2 for train data is",r2_score(Y_train, Y_pred1))
print("MSE for train data is",mean_squared_error(Y_train, Y_pred1))
print("RMSE for train data is",math.sqrt(mean_squared_error(Y_train, Y_pred1)))
print("MAE for train data is",mean_absolute_error(Y_train, Y_pred1))
print("MAPE for train data is",mean_absolute_percentage_error1(Y_train, model.predict(X_train)))
r2 for train data is 0.17834376492372517
MSE for train data is 8687.278741530838
RMSE for train data is 93.20557248110671
MAE for train data is 53.138912335056666
MAPE for train data is 61.331174880927406
In [141]:
## Fitting the RFE
In [142]:
import sys
import warnings

if not sys.warnoptions:
    warnings.simplefilter("ignore")

estimator = LinearRegression()
selector = RFE(estimator, 30, step=1)
selector = selector.fit(X_train, Y_train)
y_predict = selector.predict(X_test)
In [143]:
print("r2 for test data is",r2_score(Y_test, y_predict))
print("MSE for test data is",mean_squared_error(Y_test, y_predict))
print("RMSE for test data is",math.sqrt(mean_squared_error(Y_test, y_predict)))
print("MAE for test data is",mean_absolute_error(Y_test, y_predict))
print("MAPE for test data is",mean_absolute_percentage_error1(Y_test, y_predict))
r2 for test data is 0.15613307381972408
MSE for test data is 8711.5177133065
RMSE for test data is 93.33551153396279
MAE for test data is 52.06869745566852
MAPE for test data is 60.02791138825917
In [144]:
residuals = Y_test - y_predict
plt.figure(figsize = (10,5))
plt.scatter(test.Appliances,residuals)
plt.xlabel("Appliances")
plt.ylabel("residuals")
Out[144]:
Text(0, 0.5, 'residuals')
In [145]:
Y_pred1 = selector.predict(X_train)
print("r2 for train data is",r2_score(Y_train, Y_pred1))
print("MSE for train data is",mean_squared_error(Y_train, Y_pred1))
print("RMSE for train data is",math.sqrt(mean_squared_error(Y_train, Y_pred1)))
print("MAE for train data is",mean_absolute_error(Y_train, Y_pred1))
print("MAPE for train data is",mean_absolute_percentage_error1(Y_train, selector.predict(X_train)))
r2 for train data is 0.17529678426596762
MSE for train data is 8719.494124514618
RMSE for train data is 93.37823153452103
MAE for train data is 53.187024809362434
MAPE for train data is 61.30939132819881

We are getting the same result when were are using the linear regression and Recursive feature extraction with linear model where the MSE value is very high and r2 square values is low so that we can say that the model is not doing a better job.

In [146]:
## Fitting a Random Forest Regressor
In [147]:
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators = 100, random_state = 0) 
regressor.fit(X_train, Y_train)
Out[147]:
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
                      max_features='auto', max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, n_estimators=100,
                      n_jobs=None, oob_score=False, random_state=0, verbose=0,
                      warm_start=False)
In [148]:
Y_pred = regressor.predict(X_test)
print("r2 for test data is",r2_score(Y_test, Y_pred))
print("MSE for test data is",mean_squared_error(Y_test, Y_pred))
print("RMSE for test data is",math.sqrt(mean_squared_error(Y_test, Y_pred)))
print("MAE for test data is",mean_absolute_error(Y_test, Y_pred))
print("MAPE for test data is",mean_absolute_percentage_error1(Y_test, model.predict(X_test)))
r2 for test data is 0.5214962710813266
MSE for test data is 4939.7524432279
RMSE for test data is 70.28337245200959
MAE for test data is 32.845965125709654
MAPE for test data is 59.955125611475204
In [149]:
Y_pred1 = regressor.predict(X_train)
print("r2 for train data is",r2_score(Y_train, Y_pred1))
print("MSE for train data is",mean_squared_error(Y_train, Y_pred1))
print("RMSE for train data is",math.sqrt(mean_squared_error(Y_train, Y_pred1)))
print("MAE for train data is",mean_absolute_error(Y_train, Y_pred1))
print("MAPE for train data is",mean_absolute_percentage_error1(Y_train, regressor.predict(X_train)))
r2 for train data is 0.93768409789014
MSE for train data is 658.8590076335878
RMSE for train data is 25.668249017679173
MAE for train data is 12.245862325204351
MAPE for train data is 12.121591425611507
In [150]:
import sys
import warnings

if not sys.warnoptions:
    warnings.simplefilter("ignore")

estimator = RandomForestRegressor(n_estimators = 100, random_state = 0)
selector = RFE(estimator, 30, step=1)
selector = selector.fit(X_train, Y_train)
y_predict = selector.predict(X_test)
In [151]:
print("r2 for test data is",r2_score(Y_test, y_predict))
print("MSE for test data is",mean_squared_error(Y_test, y_predict))
print("RMSE for test data is",math.sqrt(mean_squared_error(Y_test, y_predict)))
print("MAE for test data is",mean_absolute_error(Y_test, y_predict))
print("MAPE for test data is",mean_absolute_percentage_error1(Y_test, y_predict))
r2 for test data is 0.5237355901592999
MSE for test data is 4916.635210867802
RMSE for test data is 70.11872225638315
MAE for test data is 32.871086780210874
MAPE for test data is 32.452728767290075
In [152]:
Y_pred1 = selector.predict(X_train)
print("r2 for train data is",r2_score(Y_train, Y_pred1))
print("MSE for train data is",mean_squared_error(Y_train, Y_pred1))
print("RMSE for train data is",math.sqrt(mean_squared_error(Y_train, Y_pred1)))
print("MAE for train data is",mean_absolute_error(Y_train, Y_pred1))
print("MAPE for train data is",mean_absolute_percentage_error1(Y_train, selector.predict(X_train)))
r2 for train data is 0.9379166869598652
MSE for train data is 656.3998696210227
RMSE for train data is 25.620301903393386
MAE for train data is 12.25234749712896
MAPE for train data is 12.114799113807585

We are getting the same result when were are using the Random Forest and Random forest with Recursive feature extraction where the MSE value is Low and r2 square values is High so that we can say that the model is doing a better job.

In [ ]:
 

We now try to get the Varible importance for the Random Forest

In [153]:
ranks = {}
# Create our function which stores the feature rankings to the ranks dictionary
def ranking(ranks, names, order=1):
    minmax = MinMaxScaler()
    ranks = minmax.fit_transform(order*np.array([ranks]).T).T[0]
    ranks = map(lambda x: round(x,2), ranks)
    return dict(zip(names, ranks))
In [154]:
lr = LinearRegression(normalize=True)
lr.fit(X_train,Y_train)
#stop the search when only the last feature is left
rfe = RFE(lr, n_features_to_select=1, verbose =3 )
rfe.fit(X_train,Y_train)
ranks["RFE"] = ranking(list(map(float, rfe.ranking_)), c, order=-1)
Fitting estimator with 37 features.
Fitting estimator with 36 features.
Fitting estimator with 35 features.
Fitting estimator with 34 features.
Fitting estimator with 33 features.
Fitting estimator with 32 features.
Fitting estimator with 31 features.
Fitting estimator with 30 features.
Fitting estimator with 29 features.
Fitting estimator with 28 features.
Fitting estimator with 27 features.
Fitting estimator with 26 features.
Fitting estimator with 25 features.
Fitting estimator with 24 features.
Fitting estimator with 23 features.
Fitting estimator with 22 features.
Fitting estimator with 21 features.
Fitting estimator with 20 features.
Fitting estimator with 19 features.
Fitting estimator with 18 features.
Fitting estimator with 17 features.
Fitting estimator with 16 features.
Fitting estimator with 15 features.
Fitting estimator with 14 features.
Fitting estimator with 13 features.
Fitting estimator with 12 features.
Fitting estimator with 11 features.
Fitting estimator with 10 features.
Fitting estimator with 9 features.
Fitting estimator with 8 features.
Fitting estimator with 7 features.
Fitting estimator with 6 features.
Fitting estimator with 5 features.
Fitting estimator with 4 features.
Fitting estimator with 3 features.
Fitting estimator with 2 features.
In [155]:
rf = RandomForestRegressor(n_jobs=-1, n_estimators=50, verbose=3)
rf.fit(X_train,Y_train)
ranks["RF"] = ranking(rf.feature_importances_, c);
[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 12 concurrent workers.
building tree 1 of 50building tree 2 of 50
building tree 3 of 50building tree 4 of 50
building tree 5 of 50

building tree 6 of 50

building tree 7 of 50building tree 8 of 50
building tree 9 of 50
building tree 10 of 50
building tree 11 of 50

building tree 12 of 50
building tree 13 of 50
building tree 14 of 50
building tree 15 of 50
building tree 16 of 50
building tree 17 of 50
building tree 18 of 50
building tree 19 of 50
building tree 20 of 50
building tree 21 of 50
building tree 22 of 50
building tree 23 of 50
building tree 24 of 50
[Parallel(n_jobs=-1)]: Done   8 tasks      | elapsed:    0.4s
building tree 25 of 50
building tree 26 of 50
building tree 27 of 50
building tree 28 of 50
building tree 29 of 50
building tree 30 of 50
building tree 31 of 50
building tree 32 of 50
building tree 33 of 50
building tree 34 of 50
building tree 35 of 50
building tree 36 of 50
building tree 37 of 50
building tree 38 of 50
building tree 39 of 50
building tree 40 of 50
building tree 41 of 50
building tree 42 of 50
building tree 43 of 50
building tree 44 of 50
building tree 45 of 50
building tree 46 of 50
building tree 47 of 50
building tree 48 of 50
building tree 49 of 50
building tree 50 of 50
[Parallel(n_jobs=-1)]: Done  44 out of  50 | elapsed:    1.7s remaining:    0.1s
[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed:    1.9s finished
In [156]:
r = {}
for name in c:
    r[name] = round(np.mean([ranks[method][name] for method in ranks.keys()]), 2)
 
methods = sorted(ranks.keys())
ranks["Mean"] = r
methods.append("Mean")
 
print("\t%s" % "\t".join(methods))
for name in c:
    print("%s\t%s" % (name, "\t".join(map(str, 
                         [ranks[method][name] for method in methods]))))
	RF	RFE	Mean
Appliances	0.16	0.44	0.3
lights	0.11	0.47	0.29
T1	0.17	1.0	0.58
RH_1	0.14	0.94	0.54
T2	0.24	0.97	0.6
RH_2	0.31	0.92	0.62
T3	0.28	0.72	0.5
RH_3	0.15	0.22	0.18
T4	0.17	0.03	0.1
RH_4	0.15	0.31	0.23
T5	0.2	0.06	0.13
RH_5	0.14	0.69	0.42
T6	0.15	0.14	0.15
RH_6	0.16	0.33	0.24
T7	0.16	0.39	0.28
RH_7	0.23	0.78	0.5
T8	0.16	0.75	0.46
RH_8	0.14	0.83	0.48
T9	0.15	0.28	0.22
RH_9	0.14	0.67	0.4
T_out	0.23	0.08	0.16
Press_mm_hg	0.16	0.25	0.2
RH_out	0.15	0.42	0.28
Windspeed	0.11	0.11	0.11
Visibility	0.14	0.36	0.25
Tdewpoint	0.1	0.17	0.14
rv1	0.09	0.19	0.14
rv2	1.0	0.0	0.5
NSM	0.02	0.81	0.42
Friday	0.02	0.89	0.46
Monday	0.01	0.86	0.44
Saturday	0.0	0.56	0.28
Sunday	0.0	0.61	0.3
Thursday	0.0	0.64	0.32
Tuesday	0.0	0.58	0.29
Wednesday	0.0	0.5	0.25
Weekday	0.0	0.53	0.26
In [157]:
# Put the mean scores into a Pandas dataframe
meanplot = pd.DataFrame(list(r.items()), columns= ['Feature','Mean Ranking'])

# Sort the dataframe
meanplot = meanplot.sort_values('Mean Ranking', ascending=False)
In [158]:
# Let's plot the ranking of the features
sns.factorplot(x="Mean Ranking", y="Feature", data = meanplot, kind="bar", 
               size=14, aspect=1.9, palette='coolwarm')
Out[158]:
<seaborn.axisgrid.FacetGrid at 0x1f1c82706d8>
In [ ]:
 

SUHASH REDDY IMMAREDDY - 45693242 - DATA-SCIENCE - PORTFOLIO-3

Portfolio 3 - Clustering Visualisation

K-means clustering is one of the simplest and popular unsupervised learning algorithms. Typically, unsupervised algorithms make inferences from datasets using only input vectors without referring to known, or labelled, outcomes. This notebook illustrates the process of K-means clustering by generating some random clusters of data and then showing the iterations of the algorithm as random cluster means are updated.

We first generate random data around 4 centers.

In [159]:
import numpy as np 
import pandas as pd 
from matplotlib import pyplot as plt

%matplotlib inline
In [160]:
center_1 = np.array([1,2])
center_2 = np.array([6,6])
center_3 = np.array([9,1])
center_4 = np.array([-5,-1])

# Generate random data and center it to the four centers each with a different variance
np.random.seed(5)
data_1 = np.random.randn(200,2) * 1.5 + center_1
data_2 = np.random.randn(200,2) * 1 + center_2
data_3 = np.random.randn(200,2) * 0.5 + center_3
data_4 = np.random.randn(200,2) * 0.8 + center_4

data = np.concatenate((data_1, data_2, data_3, data_4), axis = 0)

plt.scatter(data[:,0], data[:,1], s=7, c='red')
plt.show()

1. Generate random cluster centres

You need to generate four random centres.

This part of portfolio should contain at least:

  • The number of clusters k is set to 4;
  • Generate random centres via centres = np.random.randn(k,c)*std + mean where std and mean are the standard deviation and mean of the data. c represents the number of features in the data. Set the random seed to 6.
  • Color the generated centers with green, blue, yellow, and cyan. Set the edgecolors to red.
In [161]:
std = data.std()
mean = data.mean()
np.random.seed(5)

claculating the standard deviation and mean from the data with a random seed of 5 and can be further used to generate four random centers

In [162]:
k = 4
ss = ['green', 'blue', 'yellow', 'cyan']

setting the no of clusters to 4

In [163]:
centres = np.random.randn(k,2)*std + mean  ### centroids

generating the four random centers

In [164]:
plt.scatter(data[:,0], data[:,1], s=7, c='black')
plt.scatter(centres[:,0], centres[:,1],s=100, c=ss)
plt.show()

plotting the four random centers with the rndom data that we have generated

In [165]:
print(data)
[[ 1.66184123  1.50369477]
 [ 4.64615678  1.62186181]
 [ 1.16441476  4.37372168]
 ...
 [-5.94762563  0.05925507]
 [-5.5282781  -0.16683908]
 [-5.02162618 -0.15647292]]
In [166]:
d1 = pd.DataFrame({
    'X_value': data[:, 0],
    'Y_value': data[:, -1]
})
In [167]:
print(d1.head())
    X_value   Y_value
0  1.661841  1.503695
1  4.646157  1.621862
2  1.164415  4.373722
3 -0.363849  1.112545
4  1.281405  1.505195
In [168]:
print(centres)
[[ 4.29560355  0.96057399]
 [12.88931852  1.30085095]
 [ 2.86320097  9.22518032]
 [-1.53762715 -0.16579133]]

2. Visualise the clustering results in each iteration

You need to implement the process of k-means clustering. Implement each iteration as a seperate cell, assigning each data point to the closest centre, then updating the cluster centres based on the data, then plot the new clusters.

Replace this text with your explaination of the algorithm. The resulting notebook should provide a good explanation and demonstration of the K-means algorithm.

In [169]:
centroids = {
    1: [centres[0][0], centres[0][1]], 2: [centres[1][0], centres[1][1]], 3: [centres[2][0], centres[2][1]], 4: [centres[3][0], centres[3][1]]
    }

Converting the random centers into a dictionary with set of key value pair

In [170]:
# Assignment Stage
def assignment(df, centroids):
    for i in centroids.keys():
        # sqrt((x1 - x2)^2 - (y1 - y2)^2)
        df['distance_from_{}'.format(i)] = (
            np.sqrt(
                (df['X_value'] - centroids[i][0]) ** 2
                + (df['Y_value'] - centroids[i][1]) ** 2
            )
        )
    colmap = {1: 'green', 2: 'blue', 3: 'yellow', 4: 'cyan'}
    centroid_distance_cols = ['distance_from_{}'.format(i) for i in centroids.keys()]
    df['closest'] = df.loc[:, centroid_distance_cols].idxmin(axis=1)
    df['closest'] = df['closest'].map(lambda x: float(x.lstrip('distance_from_')))
    df['color'] = df['closest'].map(lambda x: colmap[x])
    return df

Using the Sqrt((x-x1)2 + (y-y1)2) which is euclidian distance we can calculate the points that are closer to the generated centroids

In [171]:
df = assignment(d1, centroids)
print(df.head())
sss =[]
for i in range(0,800):
    sss.append(df['color'][i])
    X_value   Y_value  distance_from_1  distance_from_2  distance_from_3  \
0  1.661841  1.503695         2.689179        11.229310         7.814384   
1  4.646157  1.621862         0.748458         8.249410         7.809570   
2  1.164415  4.373722         4.631838        12.120887         5.140285   
3 -0.363849  1.112545         4.661930        13.254505         8.730905   
4  1.281405  1.505195         3.063006        11.609712         7.880371   

   distance_from_4  closest  color  
0         3.608848      1.0  green  
1         6.436994      1.0  green  
2         5.282822      1.0  green  
3         1.735483      4.0   cyan  
4         3.277062      1.0  green  

Plotting the 800 random points with the color that is appropriate with respect to the closeness with the centroids

In [172]:
fig = plt.figure(figsize=(5, 5))
plt.scatter(df['X_value'], df['Y_value'], color=df['color'], alpha=0.5, edgecolor='k')
colmap = {1: 'green', 2: 'blue', 3: 'yellow', 4: 'cyan'}
for i in centroids.keys():
    plt.scatter(*centroids[i], color=colmap[i])
In [173]:
#update stage
import copy

old_centroids = copy.deepcopy(centroids)

def update(k):
    for i in centroids.keys():
        centroids[i][0] = np.mean(df[df['closest'] == i]['X_value'])
        centroids[i][1] = np.mean(df[df['closest'] == i]['Y_value'])
    return k

centroids = update(centroids)

we re calculate the new centroids from the older centroid and updating

In [174]:
suhash = 2
In [175]:
fig = plt.figure(figsize=(20, 10))
ax = plt.axes()
plt.scatter(df['X_value'], df['Y_value'], color=df['color'], alpha=0.5, edgecolor='k')
for i in centroids.keys():
    plt.scatter(*centroids[i], color=colmap[i])

for i in old_centroids.keys():
    old_x = old_centroids[i][0]
    old_y = old_centroids[i][1]
    dx = (centroids[i][0] - old_centroids[i][0]) * 0.75
    dy = (centroids[i][1] - old_centroids[i][1]) * 0.75
    ax.arrow(old_x, old_y, dx, dy, head_width=0, head_length=0.75, fc=colmap[i], ec=colmap[i])
plt.show()

This is the first itteration where based on the colored groups we now calculate the new centroids for each group

In [176]:
#Repeat Assignment Stage
df = assignment(df, centroids)

# Plot results
fig = plt.figure(figsize=(20, 10))
plt.scatter(df['X_value'], df['Y_value'], color=df['color'], alpha=0.5, edgecolor='k')
for i in centroids.keys():
    plt.scatter(*centroids[i], color=colmap[i])
plt.show()

the figure obve is the second iteration of the kmeans clustering, we can see that the some of the surrounding points were turned to new color

In [177]:
# Continue until all assigned categories don't change any more
while True:
    closest_centroids = df['closest'].copy(deep=True)
    centroids = update(centroids)
    df = assignment(df, centroids)
    suhash += 1
    if closest_centroids.equals(df['closest']):
        
        break

fig = plt.figure(figsize=(20, 10))
plt.scatter(df['X_value'], df['Y_value'], color=df['color'], alpha=0.5, edgecolor='k')
for i in centroids.keys():
    plt.scatter(*centroids[i], color=colmap[i], s=100)
plt.show()

print("\n")
print("It took " + str(suhash) + " Itterations in order to get the centroids position to be constant")

It took 6 Itterations in order to get the centroids position to be constant

It took a total of "six" Iterrations in order for the centroids to be constant

In [178]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from sklearn.datasets.samples_generator import make_blobs
from sklearn.cluster import KMeans
In [179]:
ssss = ["cyan","blue","green", "yellow"]
kmeans = KMeans(n_clusters=4, init='k-means++', max_iter=300, n_init=10, random_state=0)
pred_y = kmeans.fit_predict(data)
plt.scatter(data[:,0], data[:,1])
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=300, c=ssss)
plt.show()
In [1]:
import json
import os
In [2]:
notebooks_to_merge = [file for file in os.listdir(os.getcwd()) if file.endswith('.ipynb')]
notebooks_to_merge.sort()
print(notebooks_to_merge)
['Portfolio1.ipynb', 'Portfolio2.ipynb', 'Portfolio3.ipynb']
In [8]:
def combine_ipynb_files(list_of_notebooks, combined_file_name):
    with open (notebooks_to_merge[0], mode = 'r', encoding = 'utf-8') as f:
        a = json.load (f)
    for notebook in notebooks_to_merge[1:]:
        with open (notebook, mode = 'r', encoding = 'utf-8') as f:
            b = json.load(f)
            a['cells'].extend (b['cells'])
    with open(combined_file_name, mode='w', encoding='utf-8') as f:
        json.dump(a, f)
    print('Generated file: \"{}\".'.format(combined_file_name))

    return (os.path.realpath(combined_file_name))
combine_ipynb_files(notebooks_to_merge, "merged.ipynb")
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-8-bb481a62e0ca> in <module>
      8     with open(combined_file_name, mode='w', encoding='utf-8') as f:
      9         json.dump(a, f)
---> 10 print('Generated file: \"{}\".'.format(combined_file_name))
     11 
     12 return (os.path.realpath(combined_file_name))

NameError: name 'combined_file_name' is not defined
In [ ]: